Hypothesis Testing

PSCI 2270 - Week 9

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

October 26, 2023

Plan for this week



  1. 3-Page Proposal and OSF

  2. Hypothesis testing

  3. Discussion of two papers

3-Page Proposal

3-Page Proposal




  • Due: Next Tuesday before class

  • Aim: Brief description of your project that follows final write-up structure

  • Submission: To OSF (add me as collaborator) and post link on Brightspace

Proposal structure


  1. Background and Literature Review: 1-2 paragraphs

  2. Research Question and Theory: 1-2 paragraphs

  3. Setting/Context: 1 paragraph

  4. Independent Variables: 1 paragraph

  5. Dependent Variables (Outcomes of Interest): 1 paragraph

  6. Measurement: 1-2 paragraphs

  7. Possible Issues: 1-3 paragraphs

  • Final project includes more details on everything + estimation procedures with R code if applicable

Reviewing Literature



  • What do we know about topic related to \(X\)?

  • What do scholars think causes \(X\)?

  • What do scholars think \(X\) causes (what are its effects)?

  • Which particular cases do they study?

Relying on Reputation


  • Books from a university press with a reputation

    • Cambridge, Princeton, MIT, Harvard, Cornell, Oxford, Stanford, Michigan
    • Beware of mimicry by bogus presses (Cambridge Scholars Publishing - bad, Cambridge University Press - good)
  • Journals with a reputation for publishing good research

    • American Political Science Review, American Journal of Political Science, Journal of Politics, British Journal of Political Science, Political Science Research and Methods, Quarterly Journal of Political Science, Political Analysis
    • Subfields: International Organization, Journal of Conflict Resolution, International Studies Quarterly, Journal of Peace Research, World Politics, Comparative Political Studies
    • In Economics: American Economic Review, Quarterly Journal of Economics, Econometrica, Journal of Political Economy
    • Especially helpful: Annual Review of Political Science, Journal of Economic Literature

Reading Yourself

Section Content
Abstract Short summary-make sure you understand this!
Introduction 1. The questions the paper will try to answer
2. Why it’s important to know those answers
3. A summary of what the answers are and how they were found
Theory 1. The outcome variable (thing to be explained or measured)
2. The independent variables (things that explain outcome)
3. Hypotheses about measure of or effects on outcome
Data/Methods 1. How and what data is collected
2. How variables are measured using this data
3. Technique(s)/Method(s) used to test the hypotheses
Results 1. Do estimated relationships correspond with hypotheses?
2. Statistical and substantive significance of estimates
3. Checks of alternative explanations
Conclusion Broader implications for the field of study
Appendix/Replication archive Usually online: all details needed to verify the procedures and results and possibly to replicate

Let’s Look over the Example

Unable to display PDF file. Download instead.

Open Science Framework



  • Resource for storing research information

  • Each project is stored in a repository with version control

  • Main storage for pre-analysis plans for projects

  • Let’s go and create a project: osf.io

Hypothesis testing

Statistical hypothesis testing

  • Statistical hypothesis testing is a thought experiment

  • What would the world look like if we knew the truth?

    • Average treatment effect is \(0\)
    • Each individual effect is \(0\)
    • Sample mean is equal to \(X\)
    • etc.
  • Examples:

    • Biden’s support poll shows 40% now and it was 42% before. Did support decrease by 2% or is this purely by chance?
    • We encouraged random sample of people to go to protests and observe that their support for government policies is on average lower than among those who were not encouraged. Is this by random chance?
  • Hypothesis test: Assume there is no effect, determine what the data would look like in that world, compare this to what you observe

Conducted with several steps:



  1. Pose your null and alternative hypotheses

  2. Generate the data or its distribution assuming null is true

  3. Calculate a probability called a \(p\)-value by comparing outcomes under the null with what you observe

  4. Use \(p\)-value to decide whether to reject the null hypothesis or not

Null and alternative hypothesis

  • Null hypothesis: Some statement about the population parameters.

    • The “devil’s advocate” hypothesis \(\Rightarrow\) assumes what you seek to prove wrong
    • Ex: Biden’s approval is the same as election result
    • Ex: Treatment effect is zero for everyone
    • Denoted \(H_0\)
  • Alternative hypothesis: The statement we hope or suspect is true instead of \(H_0\).

    • It is the opposite of the null hypothesis
    • Ex: Biden’s approval fell by 2%
    • Ex: Treatment effect is different from zero (positive or negative)
    • Denoted \(H_1\) or \(H_a\)
  • Probabilistic proof by contradiction: Try to disprove the null

Practicum example


  • Parameter: Average Treatment Effect (ATE) \(\mu_T − \mu_C\) of encouragement on support for government policies

    • \(\mu_T\): Average support for government policies if everyone received encouragement
    • \(\mu_T\): Average support for government policies if no one received encouragement
  • Goal: Learn about the difference between average support for government policies between those who were encouraged and those who were not.
  • (Sharp) Null hypothesis: No treatment effect for anyone

    • \(H0\)\(Y_i(1) − Y_i(0) = 0\) for all \(i\)
    • \(H1\)\(Y_i(1) − Y_i(0) \neq 0\) at least for some \(i\) (Two-sided alternative)
    • In words: Does the treatment and control potential outcomes differ?
  • Other null hypothesis: No average treatment effect

\(P\)-value



  • \(p\)-value (based on a two-sided test): Probability of getting an (absolute) difference in means this big (or bigger) if the null hypothesis were true

    • Lower \(p\)-values \(\Rightarrow\) stronger evidence against the null
  • Intuition: How likely are we to observe what we observe if the null hypothesis is true?

  • Conclusion: We either reject (if \(p\)-value is small) or fail to reject (if \(p\)-value is large) the null

    • We never accept anything since the statement is probabilistic (more on this next time)

Observed Practicum Data

Participant \(T_i\) (Invited to protest) \(Y_i\) (Support for policies) \(Y_i (0)\) (Support for policies if not invited) \(Y_i (1)\) (Support for policies if invited)
Participant 1 1 0.859 ??? 0.859
Participant 2 1 1.930 ??? 1.930
Participant 3 1 0.875 ??? 0.875
Participant 4 0 2.944 2.944 ???
Participant 5 0 -1.015 -1.015 ???
Participant 6 0 -0.064 -0.064 ???
Participant 7 0 1.624 1.624 ???
Participant 8 0 -0.411 -0.411 ???
Participant 9 1 1.048 ??? 1.048
Participant 10 1 -0.282 ??? -0.282
  • What do we substitute for ??? if the sharp null hypothesis is true? What about null of no average effect?

Practicum Data under Sharp Null

Participant \(T_i\) (Invited to protest) \(Y_i\) (Support for policies) \(Y_i (0)\) (Support for policies if not invited) \(Y_i (1)\) (Support for policies if invited)
Participant 1 1 0.859 0.859 0.859
Participant 2 1 1.930 1.930 1.930
Participant 3 1 0.875 0.875 0.875
Participant 4 0 2.944 2.944 2.944
Participant 5 0 -1.015 -1.015 -1.015
Participant 6 0 -0.064 -0.064 -0.064
Participant 7 0 1.624 1.624 1.624
Participant 8 0 -0.411 -0.411 -0.411
Participant 9 1 1.048 1.048 1.048
Participant 10 1 -0.282 -0.282 -0.282

How to calculate?

  • We can use the formula under CLT as we did in the practicum

    • note that null here is slightly different, but it is not consequential for most applications
  • Or we can do randomization inference:

    • Assume our sample is population and redraw treatment assignment many times calculating difference-in-means each time
  • \(p\)-values produced are often very similar!

Under CLT:

Z <- c(1, 1, 1, 0, 0, 
       0, 0, 0, 1, 1)

Y <- c(0.859, 1.930, 0.875, 2.944, -1.015, 
       -0.064, 1.624, -0.411, 1.048, -0.282)

dim_observed <- mean(Y[Z == 1]) - mean(Y[Z == 0])

se <- 
  sqrt(
    var(Y[Z == 1])/length(Y[Z == 1]) + 
      var(Y[Z == 0])/length(Y[Z == 0])
  )

2 * pnorm(abs(dim_observed), sd = se, lower.tail = FALSE)
[1] 0.7382426

Randomization inference:

Z <- c(1, 1, 1, 0, 0, 
       0, 0, 0, 1, 1)

Y <- c(0.859, 1.930, 0.875, 2.944, -1.015, 
       -0.064, 1.624, -0.411, 1.048, -0.282)

dim_observed <- mean(Y[Z == 1]) - mean(Y[Z == 0])

dim_simulated <- 
  replicate(1000, {
    Z_simulated <- sample(Z)
    mean(Y[Z_simulated == 1]) - mean(Y[Z_simulated == 0])
  })

mean(abs(dim_simulated) >= abs(dim_observed))
[1] 0.739

Testing errors and false-positives

Testing errors

  • A \(p\)-value of \(0.05\) says that data this extreme would only happen in 5% of repeated samples if the null were true.

    • In other words our test results are not always correct!
    • 5% of the time we’ll reject the null when it is actually true.
  • Test results vs reality:
\(H_0\) is True \(H_0\) is False
\(H_0\) is not Rejected Great! Type II error
\(H_0\) is Rejected Type I error Amazing!
  • Type I error (False-positive) is the worst: Like convicting an innocent
  • Type II error (False negative) is less serious: Missed out on an awesome finding

College sports 🏈 and elections 🗳️


  • Irrelevant events affect voters’ evaluations of government performance by Healy, Malhotra, and Mo (2010)

  • Summary:

    • Previous research finds that natural disasters and other unpredictable events have an effect on incumbent politicians’ support
    • Authors argue that even events completely unrelated to politics can have an effect on incumbent’s support
    • Consider case of college footbal and basketball games
    • Mix of observational and survey experimental evidence
    • Several robustness excercises

Digging into argument


  • Should we expect events unrelated to politics to have an effect on voting?

    • Do government investments and response matter?
    • Do psychological factros matter?
    • What else could matter?
  • Let’s break into groups and then draw a possible causal diagram

Study 1: College Football and Elections



  • What are the dependent and independent variables?

  • How are those operationalized?

  • What are the tests they are running?
  • Why do they need so many tests?

Controls


  • What does it mean to control for the factors (covariates) in estimation? Why would we do that?
  • We control for the factors to remove the variation in dependent variable that is explained by them

    • This allows us to say that the part of variation we explain with our independent variable is not explained by the factor
    • In other words this addresses possibility that the factor is confounder (i.e. causes both independent and dependent variables)
  • Common controls: Demographics, lagged outcomes, fixed effects, etc.

Robustness and Placebo tests


  • What is robustness test?
  • Idea is to subset or append data to be able to run the test of the same expectation

    • Re-running analyses with covariates counts as robustness test
    • Each additional test that supports our theory increases our confidence in that theory
  • What is placebo? And what is placebo test?
  • Purpose of placebo test is to show that the variables do not vary with the factors irrelevant to our theory

    • Variant 1: Irrelevant independent variable does not predict outcome
    • Variant 2: Independent variable does not predict irrrelevant outcomes

Evidence


Study 2: Survey experiment


  • What are the dependent and independent variables?

  • How are those operationalized?

  • What are the tests they are running?

    • And why do we need to run this?

Evidence


Story of false-positives 😱


  • College football, elections, and false-positive results in observational research by Fowler and Montagnes (2015)

Summary:

  • Healy, Malhotra, and Mo (2010) did a great job and the paper had been influential, BUT
  • …Their results could have been produced by chance (spurious correlation) \(\Rightarrow\) False-positive
  • Approach is to take similar data and run additional robustness checks (akin to replication of experiment)
  • Additional robustness checks fail to support original theory

How do they show this

  • What is dependent and independent variables? How are they operationalized?

  • What tests do they run? Why do they run the same regression initially?

How do they explain these results


  • Fowler and Montagnes (2015) acknowledge that Healy, Malhotra, and Mo (2010) also ran robustness/placebo tests, BUT:
  1. Placebo tests do not provide evidence against false-positive

  2. Including controls (demographics and fixed effects) is not fully independent test and cannot guard against false-positive

  3. Explanation about why effects 10 days before would be stronger than 3 days before could suggest ex-post theory adjustment (wow…)

  4. They run multiple tests of each hypothesis but do not adjust for multiple comparisons (Bonferroni correction)

  5. Championships are bad proxy for interest and fail to replicate high-attendance results,

💀⚰️⏹️

Broader implications for research design



  • Think about all possible implications of theory and see if you can test them with your or other available data

    • This is especially important for observational studies
  • Operationalization of variables is crucial!

  • If you run many tests of the same hypothesis you need to adjust for multiple comparisons since you can come up with significant results just by chance

References

Fowler, Anthony, and B. Pablo Montagnes. 2015. “College Football, Elections, and False-Positive Results in Observational Research.” Proceedings of the National Academy of Sciences 112 (45): 13800–13804. https://doi.org/10.1073/pnas.1502615112.
Healy, Andrew J., Neil Malhotra, and Cecilia Hyunjung Mo. 2010. “Irrelevant Events Affect Voters’ Evaluations of Government Performance.” Proceedings of the National Academy of Sciences 107 (29): 1280412809.